Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

WIP: Move DateTime native code into managed #21848

Closed
wants to merge 4 commits into from

Conversation

filipnavara
Copy link
Member

Just testing build and unit tests now. If it works out then the code could be shared between CoreCLR and CoreRT completely.

@jkotas
Copy link
Member

jkotas commented Jan 7, 2019

Some of this is implemented as FCalls to avoid PInvoke overhead. This would need performance testing as well.

@filipnavara
Copy link
Member Author

Some of this is implemented as FCalls to avoid PInvoke overhead. This would need performance testing as well.

I am aware of that. The history of the repository shows that has been done in other places. However, the BOTR explicitly states

If the only reason you're defining a FCall method is to call a native Win32 method, you should be using P/Invoke to call Win32 directly.

Maybe it's time to have some updated guidance on that.

@filipnavara
Copy link
Member Author

@dotnet-bot test Windows_NT arm Cross Debug Innerloop Build
@dotnet-bot test Windows_NT arm64 Cross Debug Innerloop Build
@dotnet-bot test Windows_NT x64 Release CoreFX Tests

@filipnavara
Copy link
Member Author

I've done some preliminary benchmarks and it shows some rather interesting things. Firstly, my initial implementation with LazyInitializer and delegates is about 2x slower than the baseline. Switching to an approach with statically initialized s_systemSupportsPreciseSystemTime variable and simple if condition cuts down the difference quite a lot:

Method Mean Error StdDev
UtcNow [Old] 25.32 ns 0.4442 ns 0.3938 ns
UtcNow [New] 30.75 ns 0.6206 ns 0.5805 ns

Now, these numbers are not completely representative since it was run on .NET Core 2.1.6 runtime with a code copied to a separate BenchmarkDotNet project. Perhaps there's a better way to benchmark something like this?

Another interesting observation is that the GetSystemTimePreciseAsFileTime is nearly twice as slow on my system as GetSystemTimeAsFileTime.

@Suchiman
Copy link

Suchiman commented Jan 8, 2019

@filipnavara

Another interesting observation is that the GetSystemTimePreciseAsFileTime is nearly twice as slow on my system as GetSystemTimeAsFileTime.

That seems to be expected: #9736

@filipnavara
Copy link
Member Author

filipnavara commented Jan 8, 2019

@Suchiman Thanks, I missed that remark in the PR.

Looks like .NET Core 2.1.6 doesn't properly optimize the static readonly bool variable trick. I will retry the benchmarks on newer runtime.

@filipnavara
Copy link
Member Author

I had some flaws in my benchmark. This is current state with .NET Core 3.0 Preview 1 runtime:

Method Mean Error StdDev
UtcNow [Old] 25.34 ns 0.1404 ns 0.1313 ns
UtcNow [New] 27.81 ns 0.2964 ns 0.2628 ns

Is this worth pursuing further? Maybe there are ways to squeeze the last few nanoseconds back.

@danmoseley
Copy link
Member

@filipnavara UtcNow is often on the hot path, yes I think it is worth investigating further perf improvements.

@filipnavara
Copy link
Member Author

Well, I can scale back the PR and keep the FCalls and move just the rest to shared partition. I am not sure if I will be able to squeeze any more performance out of the P/Invoke approach.

@jkotas
Copy link
Member

jkotas commented Jan 9, 2019

The performance is closer than what I thought it is going to be. I will take a look at this in detail when I get a chance.

@filipnavara
Copy link
Member Author

Thanks! Any advice or help is appreciated.

@filipnavara
Copy link
Member Author

There are some particularly bad things in the code gen, which could likely be eliminated:

00007ff8`c8b1f667 488945b0        mov     qword ptr [rbp-50h],rax
00007ff8`c8b1f66b eb00            jmp     00007ff8`c8b1f66d // Jumping to next instruction, really?
00007ff8`c8b1f66d 488b45b0        mov     rax,qword ptr [rbp-50h] // Loading whatever we wrote there two instructions ago

@mikedn
Copy link

mikedn commented Jan 9, 2019

There are some particularly bad things in the code gen, which could likely be eliminated:

Hmm, can you post more code? I've seen such jumps before but they're not common.

@filipnavara
Copy link
Member Author

@mikedn I extracted the PInvoke alone into a small function

        public unsafe void JustPInvoke()
        {
            long timestamp;
            Interop.Kernel32.GetSystemTimePreciseAsFileTime(&timestamp);
        }

and looked at the generated code (.NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT):

; DateTimeTest.DateTimeBenchmark.JustPInvoke()
       push    rbp
       push    r15
       push    r14
       push    r13
       push    r12
       push    rdi
       push    rsi
       push    rbx
       sub     rsp,78h
       lea     rbp,[rsp+0B0h]
       xor     edx,edx
       mov     qword ptr [rbp-40h],rdx
       mov     qword ptr [rbp-48h],rdx
       mov     qword ptr [rbp+10h],rcx
       lea     rcx,[rbp-80h]
       mov     rdx,r10
       call    coreclr+0xad090
       mov     qword ptr [rbp-48h],rax
       mov     rcx,rsp
       mov     qword ptr [rbp-60h],rcx
       mov     rcx,rbp
       mov     qword ptr [rbp-50h],rcx
       lea     rcx,[rbp-40h]
       xor     r11d,r11d
       mov     rax,7FF89F1F39D8h
       mov     qword ptr [rbp-70h],rax
       lea     rax,[00007ff8`9f0affde]
       mov     qword ptr [rbp-58h],rax
       mov     rax,qword ptr [rbp-48h]
       lea     rdx,[rbp-80h]
       mov     qword ptr [rax+10h],rdx
       mov     rax,qword ptr [rbp-48h] ; This is loading a value that was already loaded 4 lines above
       mov     byte ptr [rax+0Ch],0
       call    qword ptr [00007ff8`9f1f3b70]
       mov     rdx,qword ptr [rbp-48h]
       mov     byte ptr [rdx+0Ch],1
       cmp     dword ptr [coreclr!g_CLREngineMetrics+0x9080
       je      M00_L00
       call    qword ptr [coreclr!CreateCLRProfiling+0x20b188
M00_L00:
       mov     rax,qword ptr [rbp-48h]
       mov     rdx,qword ptr [rbp-78h]
       mov     qword ptr [rax+10h],rdx
       jmp     M00_L01 ; Unnecessary jump
M00_L01:
       mov     rax,qword ptr [rbp-48h] ; Unnecessary load (probably not eliminated because of the jump?)
       mov     byte ptr [rax+0Ch],1
       lea     rsp,[rbp-38h]
       pop     rbx
       pop     rsi
       pop     rdi
       pop     r12
       pop     r13
       pop     r14
       pop     r15
       pop     rbp
       ret
...
; Total bytes of code 204

It could very well be something screwed up by BenchmarkDotNet or my .NET Core installation. Smallest repro code here:

using System;
using System.Runtime.InteropServices;

public class C {
        [DllImport("kernel32", ExactSpelling = true, SetLastError = false)]
        internal static unsafe extern void GetSystemTimePreciseAsFileTime(long* lpSystemTimeAsFileTime);

    	public unsafe void JustPInvoke()
        {
            long timestamp;
            GetSystemTimePreciseAsFileTime(&timestamp);
        }
}

@AndyAyersMS
Copy link
Member

You are probably looking at unoptimized tier-0 code; here's what I see for your repro.

; Assembly listing for method C:JustPInvoke():this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd
;  V01 loc0         [V01    ] (  1,  1   )    long  ->  [rbp-0x40]   do-not-enreg[X] must-init addr-exposed ld-addr-op
;  V02 OutArgs      [V02    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V03 FramesRoot   [V03,T00] (  7,  7   )    long  ->  rsi         "Pinvoke FrameListRoot"
;  V04 PInvokeFrame [V04    ] (  8,  8   )     blk (64) [rbp-0x80]   do-not-enreg[X] addr-exposed "Pinvoke FrameVar"
;
; Lcl frame size = 104

G_M55676_IG01:
       push     rbp
       push     r15
       push     r14
       push     r13
       push     r12
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 104
       lea      rbp, [rsp+A0H]
       xor      rdx, rdx
       mov      qword ptr [rbp-40H], rdx

G_M55676_IG02:
       lea      rcx, bword ptr [rbp-78H]
       mov      rdx, r10
       call     CORINFO_HELP_INIT_PINVOKE_FRAME
       mov      rsi, rax
       mov      rcx, rsp
       mov      qword ptr [rbp-58H], rcx
       mov      rcx, rbp
       mov      qword ptr [rbp-48H], rcx
       lea      rcx, bword ptr [rbp-40H]
       xor      r11, r11
       mov      rax, 0xD1FFAB1E
       mov      qword ptr [rbp-68H], rax
       lea      rax, G_M55676_IG04
       mov      qword ptr [rbp-50H], rax
       lea      rax, bword ptr [rbp-78H]
       mov      qword ptr [rsi+16], rax
       mov      byte  ptr [rsi+12], 0

G_M55676_IG03:
       call     [C:GetSystemTimePreciseAsFileTime(long)]

G_M55676_IG04:
       mov      byte  ptr [rsi+12], 1
       cmp      dword ptr [(reloc)], 0
       je       SHORT G_M55676_IG05
       call     [CORINFO_HELP_STOP_FOR_GC]

G_M55676_IG05:
       mov      rax, bword ptr [rbp-70H]
       mov      qword ptr [rsi+16], rax

G_M55676_IG06:
       mov      byte  ptr [rsi+12], 1

G_M55676_IG07:
       lea      rsp, [rbp-38H]
       pop      rbx
       pop      rsi
       pop      rdi
       pop      r12
       pop      r13
       pop      r14
       pop      r15
       pop      rbp
       ret      

; Total bytes of code 157, prolog size 30 for method C:JustPInvoke():this

BDN should be getting you to tier-1 code, but you can force the issue via COMPlus_TieredCompilation=0.

@filipnavara
Copy link
Member Author

@AndyAyersMS Yeah, there's BDN bug filled for that (dotnet/BenchmarkDotNet#899). I was seeing some weird output with COMPlus_TieredCompilation=0 earlier, but likely due to some other mistake I made. Your output is way more readable than whatever I managed to get locally.

@Suchiman
Copy link

Suchiman commented Jan 9, 2019

@stephentoub
Copy link
Member

Another interesting observation is that GetSystemTimePreciseAsFileTime is nearly twice as slow

Yeah, we decided in #9736 that the performance difference was worth it for the accuracy.

@filipnavara
Copy link
Member Author

Closing for now, thanks for all the valuable feedback. The Unix part was superseded by #22383 and the P/Invoke performance improvements are discussed there and in #22320.

@filipnavara
Copy link
Member Author

Note to myself: This should be resurrected now that SuppressGCTransition attribute is available.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants